Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

The estimated mean of a Gaussian distribution with varying sampling sizes.

reast cancer data set with 569 samples [Wolberg, et al., 1994;

et al., 1995] was also used to demonstrate how the random

approach can help reach the real data proportion. Among these

ples, 212 were malignant tumours. The malignancy ratio was

0.373, i.e., 37.3% were malignant tumours in this data set. Many

alf of samples were drawn from this data set randomly. The

cy ratio was calculated within the drawn sample for varying

times from ten to 1,000. For instance, for K sampling times, K

were drawn. K malignancy ratios were calculated for K samples.

ds, the mean values of K malignancy ratios were recorded. It was

when the times of random sampling was increasing, the

d malignancy ratio within the drawn sample should approach to

atio, i.e., 0.373. Figure 3.14 shows this simulation. From this plot,

seen that when the sampling times increased, the estimated

cy ratio among randomly drawn samples was indeed approaching

The malignancy ratio within sampled data with varying sampling times for the

er data set.

scussed above, it can be seen that the random sampling approach

d many repeats to reach a reasonable approximate of real value of